Bike share programs like Philadelphia’s Indego Bike Share provide shared bicycles to people for short periods, facilitating an adaptable and sustainable mode of transit. Despite its convenience, the system often face issues in ensuring that bikes are available and docking spaces are open where needed, a process known as re-balancing. Effective re-balancing is vital for maintaining the functionality of the bike share system, as it impacts the user’s ability to access and return bikes conveniently.
One approach to re-balancing involves incentivizing users with rewards or discounts. This method engages users in the re-balancing process by offering them benefits for picking up or dropping off bikes at specific locations to ensure bikes are available across all stations. Such a system could use predictive models to identify when and where bikes are needed, based on user trends and station activity. The incentives could be adjusted in real-time, encouraging users to naturally redistribute bikes as part of their routine use of the service.
The precision of re-balancing efforts is dependent on the ability to forecast demand with reasonable accuracy. The appropriate time lag features to consider in the predictive model could range from a few hours to a full day or more, depending on the typical usage patterns observed. Shorter time lags may be ideal for swift corrections, while longer lags could allow for more strategic planning, such as overnight balancing. The prediction interval will shape both the incentive scheme for users and the logistical considerations, including whether there is a need for manual bike redistribution using trucks. The ultimate aim is to devise an automated system that reduces the need for manual re-balancing and upholds a high level of user experience and system dependability.
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(mapview)
library(FNN)
library(caret)
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette5_2 <- c("#08519c", "#3182bd", "#6baed6", "#bdd7e7", "#eff3ff")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
dat <- read.csv("indego-trips-2023-q3-2.csv")
dat$timestamp <- mdy_hm(dat$start_time)
dat <- dat[dat$timestamp >= ymd_hms("2023-08-06T00:00:00Z") &
dat$timestamp <= ymd_hms("2023-09-10T00:00:00Z"), ]
dat2 <- dat %>%
mutate(interval60 = floor_date(ymd_hms(timestamp), unit = "hour"),
interval15 = floor_date(ymd_hms(timestamp), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
census <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2021,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
tracts <-
census %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
dplyr::select(GEOID, geometry) %>%
st_sf
dat_census <- st_join(dat2 %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lon) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
tracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lon = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., tracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lon = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)
Weather data plays a critical role in modeling for bike share systems due to its significant impact on rider behavior. Adverse weather can influence a person’s decision to use bike share services. Pleasant conditions are likely to encourage bike usage. Understanding these weather-related patterns is essential for accurate demand forecasting. Meanwhile, the heat index, often referred to as the “feels-like” temperature, is an important metric for modeling bike share demand because it combines air temperature with relative humidity to estimate the perceived outdoor temperature on the human body. This perceived temperature is more impactful on a person’s decision to engage in outdoor activities than the actual air temperature alone.
weather.Panel <-
riem_measures(station = "PHL", date_start = "2023-08-06", date_end = "2023-09-09") %>%
dplyr::select(valid, tmpf, p01i, sknt, relh)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt),
Humidity = max(relh),
Heat_Index = ifelse(Temperature >= 80,-42.379 + 2.04901523 * Temperature + 10.14333127 * Humidity -
0.22475541 * Temperature * Humidity
- 0.00683783 * Temperature^2 - 0.05481717 * Humidity^2 + 0.00122874 *
Temperature^2 * Humidity + 0.00085282 * Temperature
* Humidity^2 - 0.00000199 * Temperature^2 * Humidity^2, Temperature)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel)
## Rows: 817
## Columns: 6
## $ interval60 <dttm> 2023-08-06 00:00:00, 2023-08-06 01:00:00, 2023-08-06 02…
## $ Temperature <dbl> 74, 74, 71, 71, 71, 71, 75, 77, 81, 83, 85, 86, 86, 87, …
## $ Precipitation <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, …
## $ Wind_Speed <dbl> 4, 5, 4, 2, 3, 5, 5, 6, 7, 7, 7, 5, 8, 10, 9, 11, 11, 11…
## $ Humidity <dbl> 63.85, 63.85, 78.54, 78.54, 75.84, 75.84, 68.64, 66.50, …
## $ Heat_Index <dbl> 74.00000, 74.00000, 71.00000, 71.00000, 71.00000, 71.000…
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
ggplot(weather.Panel, aes(interval60,Humidity)) + geom_line() +
labs(title="Humidity", x="Hour", y="Humidity") + plotTheme,
ggplot(weather.Panel, aes(interval60,Heat_Index)) + geom_line() +
labs(title="Heat Index", x="Hour", y="Heat Index") + plotTheme,
top="Weather Data - Philadelphia PHL - August-September, 2023")
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr. Philadelphia, Aug-Sep, 2023",
x="Date",
y="Number of trips")+
plotTheme
## Warning: Removed 1 row containing missing values (`geom_line()`).
This plot shows the bike share trips per hour in Philadelphia from Aug 6
to Sep 10, 2023.
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, Aug-Sep, 2023",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme
The plot demonstrates the mean number of hourly trips is high during the rush hours in the morning, noon and afternoon.
ggplot(dat_census %>%
group_by(interval60, start_station) %>%
tally())+
geom_histogram(aes(n), binwidth = 5)+
labs(title="Bike share trips per hr by station. Philadelphia, Aug-Sep, 2023",
x="Trip Counts",
y="Number of Stations")+
plotTheme
ggplot(dat_census %>% mutate(hour = hour(timestamp)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips in Philadelphia, by day of the week, Aug-Sep, 2023",
x="Hour",
y="Trip Counts")+
plotTheme
## Warning: Removed 26 rows containing missing values (`geom_path()`).
ggplot(dat_census %>%
mutate(hour = hour(timestamp),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in Philadelphia - weekend vs weekday, Aug-Sep, 2023",
x="Hour",
y="Trip Counts")+
plotTheme
This plot shows that the bike share trips are concentrated during the morning and afternoon rush hour on weekdays.
ggplot()+
geom_sf(data = tracts %>%
st_transform(crs=4326))+
geom_point(data = dat_census %>%
mutate(hour = hour(timestamp),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
tally(),
aes(x=start_lon, y = start_lat, color = n),
fill = "transparent", alpha = 0.6, size = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station. Philadelphia, Aug-Sep, 2023")+
mapTheme
The maps demonstrates that the center city area has higher trips than the stations in the rest of the city.
# parks <-
# st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson",
# quiet = TRUE) %>%
# st_transform('ESRI:102729')
#
# parks4buff <- parks %>% dplyr::select(geometry)
#
# #parks within 1/2 mile
# dat_census$parks.Buffer <- dat_census %>%
# st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326) %>%
# st_transform('ESRI:102729') %>%
# st_buffer(2640) %>%
# aggregate(mutate(parks4buff, counter = 1),., sum) %>%
# pull(counter)
#
# dat_census <- dat_census %>% mutate(parks.Buffer = replace_na(parks.Buffer, 0))
#
# subway <- st_read("Highspeed_Stations.geojson", quiet = TRUE) %>%
# st_transform('ESRI:102729') %>%
# dplyr::select(geometry)
# trolley <- st_read("Trolley_Stations.geojson", quiet = TRUE) %>%
# st_transform('ESRI:102729') %>%
# dplyr::select(geometry)
# rr <- st_read("Regional_Rail_Stations.geojson", quiet = TRUE) %>%
# st_transform('ESRI:102729') %>%
# dplyr::select(geometry)
# septa <- rbind(subway, trolley, rr)
#
# dat_census <-
# dat_census %>%
# st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326, remove=FALSE) %>%
# st_transform('ESRI:102729')
#
# dat_census <-
# dat_census %>%
# mutate(
# septa_nn1 = nn_function(st_coordinates(dat_census),
# st_coordinates(septa), k = 1),
#
# septa_nn2 = nn_function(st_coordinates(dat_census),
# st_coordinates(septa), k = 2),
#
# septa_nn3 = nn_function(st_coordinates(dat_census),
# st_coordinates(septa), k = 3),
#
# septa_nn4 = nn_function(st_coordinates(dat_census),
# st_coordinates(septa), k = 4),
#
# septa_nn5 = nn_function(st_coordinates(dat_census),
# st_coordinates(septa), k = 5))
#
# dat_census <- dat_census %>%
# st_drop_geometry()
In constructing a time-series panel for the Indego bike share system in Philadelphia, we essentially craft a dataset where each row corresponds to a specific hour and day for each bike station. This structure enables us to maintain a comprehensive dataset that includes all time periods within the scope of our study.
#unique hours and unique stations
length(unique(dat_census$interval60)) * length(unique(dat_census$start_station))
## [1] 196560
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
start_station = unique(dat_census$start_station)) %>%
left_join(., dat_census %>%
select(start_station, Origin.Tract, start_lon, start_lat)%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
nrow(study.panel)
## [1] 196560
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, census %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
parks <-
st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson",
quiet = TRUE) %>%
st_transform('ESRI:102729')
parks4buff <- parks %>% dplyr::select(geometry)
#parks within 1/2 mile
ride.panel$parks.Buffer <- ride.panel %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326) %>%
st_transform('ESRI:102729') %>%
st_buffer(2640) %>%
aggregate(mutate(parks4buff, counter = 1),., sum) %>%
pull(counter)
ride.panel <- ride.panel %>% mutate(parks.Buffer = replace_na(parks.Buffer, 0))
subway <- st_read("Highspeed_Stations.geojson", quiet = TRUE) %>%
st_transform('ESRI:102729') %>%
dplyr::select(geometry)
trolley <- st_read("Trolley_Stations.geojson", quiet = TRUE) %>%
st_transform('ESRI:102729') %>%
dplyr::select(geometry)
rr <- st_read("Regional_Rail_Stations.geojson", quiet = TRUE) %>%
st_transform('ESRI:102729') %>%
dplyr::select(geometry)
septa <- rbind(subway, trolley, rr)
ride.panel <-
ride.panel %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326, remove=FALSE) %>%
st_transform('ESRI:102729')
ride.panel <-
ride.panel %>%
mutate(
septa_nn1 = nn_function(st_coordinates(ride.panel),
st_coordinates(septa), k = 1),
septa_nn2 = nn_function(st_coordinates(ride.panel),
st_coordinates(septa), k = 2),
septa_nn3 = nn_function(st_coordinates(ride.panel),
st_coordinates(septa), k = 3),
septa_nn4 = nn_function(st_coordinates(ride.panel),
st_coordinates(septa), k = 4),
septa_nn5 = nn_function(st_coordinates(ride.panel),
st_coordinates(septa), k = 5))
ride.panel <- ride.panel %>%
st_drop_geometry()
We created the nearest neighbor feature for the SEPTA rail stations in Philadelphia to assess how closely connected the stations are with the Indego bike stations. The plot shows that center city particularly has more closely connected SEPTA and Indego stations which contribute to seamless connection for users.
ggplot() +
geom_sf(data = tracts %>%
st_transform(crs=4326) , fill = "lightgray") +
geom_sf(data = ride.panel %>% group_by(start_station) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326, remove=FALSE),
aes(colour = q5(parks.Buffer)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(ride.panel,"parks.Buffer"),
name="Quintile\nBreaks") +
labs(title="PPR Parks within 1/2 Mile of Bike Station, Philadelphia",
caption = "Figure X. Data from OpenDataPhilly.org.") +
mapTheme +
guides(colour = guide_legend(override.aes = list(size=4)))
ggplot() +
geom_sf(data = tracts %>%
st_transform(crs=4326) , fill = "lightgray") +
geom_sf(data = ride.panel %>% group_by(start_station) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326, remove=FALSE),
aes(colour = q5(septa_nn3)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5_2,
labels=qBr(ride.panel,"septa_nn3"),
name="Quintile\nBreaks") +
labs(title="k=3 Nearest Neighbor Score for Septa Rail Stations, Philadelphia") +
mapTheme +
guides(colour = guide_legend(override.aes = list(size=4)))
ggplot(ride.panel %>%
group_by(Temperature) %>%
summarise(count = sum(Trip_Count))) +
geom_point(aes(x = Temperature, y = count))+
labs(title="Bike share trips per temperature, Philadelphia, Aug-Sep, 2023",
x="Temperature",
y="Number of trips")+
plotTheme
This plot shows that during the summer, people tend to use Indego bikes when the temperature is relatively more pleasant and less extreme.
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 247,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 × 2
## Variable correlation
## <fct> <dbl>
## 1 lagHour 0.87
## 2 lag2Hours 0.65
## 3 lag3Hours 0.43
## 4 lag4Hours 0.23
## 5 lag12Hours -0.58
## 6 lag1day 0.88
We divide our dataset into two parts: one for training and the other for testing purposes. Utilizing the lm function, we construct five distinct linear models. Initially, our models are structured with only time-related variables. Subsequently, we expand upon these initial models by incorporating the lag data and various other predictive variables that we outlined in the preceding section.
ride.Train <- filter(ride.panel, week >= 34)
ride.Test <- filter(ride.panel, week < 34)
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Heat_Index, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station + dotw + Heat_Index + parks.Buffer + septa_nn3, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Heat_Index + Precipitation +
parks.Buffer + septa_nn3,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Heat_Index + Precipitation +
parks.Buffer + septa_nn3 +
lagHour + lag2Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Heat_Index + Precipitation +
parks.Buffer + septa_nn3 +
lagHour + lag2Hours + lag1day + holidayLag + holiday,
data=ride.Train)
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions
## # A tibble: 10 × 8
## week data Regression Prediction Observed Absolute_Error MAE sd_AE
## <dbl> <list> <chr> <list> <list> <list> <dbl> <dbl>
## 1 32 <tibble> ATime_FE <dbl> <dbl> <dbl [39,312]> 0.819 0.946
## 2 33 <tibble> ATime_FE <dbl> <dbl> <dbl [39,312]> 0.849 0.974
## 3 32 <tibble> BSpace_FE <dbl> <dbl> <dbl [39,312]> 0.802 0.922
## 4 33 <tibble> BSpace_FE <dbl> <dbl> <dbl [39,312]> 0.836 0.954
## 5 32 <tibble> CTime_Space_FE <dbl> <dbl> <dbl [39,312]> 0.793 0.920
## 6 33 <tibble> CTime_Space_FE <dbl> <dbl> <dbl [39,312]> 0.818 0.947
## 7 32 <tibble> DTime_Space_FE… <dbl> <dbl> <dbl [39,312]> 0.667 0.834
## 8 33 <tibble> DTime_Space_FE… <dbl> <dbl> <dbl [39,312]> 0.679 0.841
## 9 32 <tibble> ETime_Space_FE… <dbl> <dbl> <dbl [39,312]> 0.671 0.830
## 10 33 <tibble> ETime_Space_FE… <dbl> <dbl> <dbl [39,312]> 0.685 0.837
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
As illustrated, the models that integrate time lags and exposure variables have lower MAE values, aligning more closely with the actual observed trends.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme
In Philadelphia, the largest discrepancies occur within the core part of Center City. This sector registers the highest volume of bike trips, which correlates with the greater margin of error observed. Given the central role of the Central Business District (CBD) in Philadelphia, tailoring certain predictive features specifically to this key area could potentially refine the accuracy of the model outcomes.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon)) %>%
select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags") %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = census, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
labs(title="Mean Abs Error, Test Set, Model 4")+
mapTheme
In Philadelphia, there’s more bike usage on weekdays in comparison to weekends, leading to increased errors in our weekday forecasts. This pattern is also evident when examining daily time intervals, particularly during the morning and evening rush hours.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = census, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", size = 0.8, alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
The plot of errors as a function of socio-economic variables demonstrate that there are more errors when median income and white people percentage are higher, while the errors are smaller when the percentage of people taking public transit is bigger.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, start_station, start_lon,
start_lat, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
In Philadelphia, ensuring that a model can be applied broadly and reliably is essential, which is where cross-validation comes into play. We conduct a 100 k-fold cross-validation on a subset of the data. The results indicate a Mean Absolute Error (MAE) of 0.68, which is relatively small.
## define the variables we want
reg.vars <- c("start_station", "hour(interval60)", "dotw", "Heat_Index", "Precipitation",
"lagHour", "lag2Hours", "lag1day")
ride.panel.cv <- ride.panel %>%
mutate(cvID = sample(round(nrow(ride.panel) / 100),
size=nrow(ride.panel), replace = TRUE))
## RUN REGRESSIONS
#k-fold
#reg.CV <- crossValidate(
# dataset = ride.panel,
# id = "cvID",
# dependentVariable = "Trip_Count",
# indVariables = reg.vars) %>%
# dplyr::select(cvID, Trip_Count, Prediction, geometry)
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.CV <-
train(Trip_Count ~ start_station + hour(interval60) + dotw +
Heat_Index + Precipitation + lagHour + lag2Hours + lag1day,
data = ride.panel,
method = "lm", trControl = fitControl, na.action = na.pass)
ggplot(reg.CV$resample, aes(x=MAE)) +
geom_histogram(fill = "#08519c", color = "white") +
labs(
title = "Distribution of MAE Across 100-Fold Cross Validation",
subtitle = "",
caption = "Figure 10")
Our model demonstrates a relatively smaller margin of error, primarily due to the robust predictive power of the time lag features. The conducted cross-validation confirms the model’s effectiveness which indicates that it performs just as well in practice as the results would have us believe. It also predicts well across different geographic areas of Philadelphia. Enhancements could be made by incorporating specific predictive factors for areas with high traffic, such as Center City.
Regarding the re-balancing strategy, the model’s precise forecasts of hourly and daily usage patterns greatly facilitate the prediction of specific departure and arrival bike trips. This predictive insight allows us to strategically offer incentives to users where necessary, helping to balance bike availability with user demand throughout Philadelphia’s bike share system.